Sparsifying priors for Bayesian uncertainty quantification in model discovery

We propose a probabilistic model discovery method for identifying ordinary differential equations governing the dynamics of observed multivariate data. Our method is based on the sparse identification of nonlinear dynamics (SINDy) framework, where models are expressed as sparse linear combinations of pre-specified candidate functions. Promoting parsimony through sparsity leads to interpretable models that generalize to unknown data. Instead of targeting point estimates of the SINDy coefficients, we estimate these coefficients via sparse Bayesian inference. The resulting method, uncertainty quantification SINDy (UQ-SINDy), quantifies not only the uncertainty in the values of the SINDy coefficients due to observation errors and limited data, but also the probability of inclusion of each candidate function in the linear combination. UQ-SINDy promotes robustness against observation noise and limited data, interpretability (in terms of model selection and inclusion probabilities) and generalization capacity for out-of-sample forecast. Sparse inference for UQ-SINDy employs Markov chain Monte Carlo, and we explore two sparsifying priors: the spike and slab prior, and the regularized horseshoe prior. UQ-SINDy is shown to discover accurate models in the presence of noise and with orders-of-magnitude less data than current model discovery methods, thus providing a transformative method for real-world applications which have limited data.


Introduction
In recent years, there has been a rapid increase in measurements gathered from complex nonlinear dynamics for which their governing equations are unknown. A key challenge is to discover explicit representations of these equations, which can then be used for system identification, forecasting and control. Measurements are often compromised by noise or may exhibit chaotic behaviour, in which case it is critical to quantify how uncertainty affects the model discovery process. To address this challenge, we introduce the uncertainty quantification sparse identification of nonlinear dynamics (UQ-SINDy) framework, which leverages sparsity promotion in a Bayesian probabilistic setting to extract a parsimonious set of governing equations. Our method provides uncertainty estimates of both the parameter values and the inclusion probabilities for different terms in the models.
Discovery of governing equations plays a fundamental role in the development of physical theories. With increasing computing power and data availability in recent years, there have been substantial efforts to identify the governing equations directly from data [1][2][3]. There has been particular emphasis on parsimonious representations because they have the benefits of promoting interpretability and generalizing well to unknown data [4][5][6][7][8][9][10][11]. The SINDy method was proposed in [5], which leverages dictionary learning and sparse regression to model dynamical systems. This approach has been successful in modelling a diversity of applications, including in chemistry [12], optics [13], engineered systems [14], epidemiology [15] and plasma physics [16]. Furthermore, there has been a variety of modifications, including improved robustness to noise [17][18][19], generalizations to partial differential equations [20][21][22], multi-scale physics [23] and libraries of rational functions [24,25].
Although these methods identify the equations, measurements often contain observation errors, which may imperil the predictive capacity of learned models. A common approach to remedy this is to use the Bayesian probability framework where uncertainty is quantified in terms of probability and where priors are employed to encode assumptions and knowledge about model parameters [26,27]. Bayesian methods have been widely used for uncertainty quantification in time series models, with applications to weather forecasting [28][29][30], disease modelling [31][32][33], traffic flow [34][35][36] and finance [37][38][39], among many others. By leveraging informative priors Bayesian inference can achieve better results for limited data compared to frequentist approaches [40][41][42]. More recently, these methods have been incorporated into model discovery frameworks, exhibiting state-of-the-art performance for system identification in the presence of noise [3,43,44]. Although these methods provide a range of possible values, realizations of these models are in general not sparse and consequently lack the capability to identify relevant terms in the model. Sparse regression is a popular tool to identify a small subset of variables that explain the data. However, finding the true minimum is computationally intractable in practice. In the frequentist setting, a popular solution is to use the Lasso, which corresponds to an l 1 penalty term [45]. In the Bayesian setting, sparsity is generated by fundamentally different mechanisms. Most notably, although the corresponding prior (the Laplace prior) shares the same maximum likelihood estimator as the Lasso [46], the distribution has fat tails and thus does not produce sparse realizations [47]. The spike and slab model remedies this by explicitly using Bernoulli variables to determine whether a term is present in the model, and has become the leading method for incorporating sparsity in the Bayesian framework [48][49][50]. One disadvantage to this prior, however, is its dependence on discrete variables, which makes inference prohibitively expensive for high-dimensional systems. Smooth approximations, such as the horseshoe [51,52], horseshoe+ [53], regularized horseshoe [54], Dirichlet-Laplace [55] and R2-D2 priors [56], have been shown to yield performance comparable to the spike and slab model. For this work, we will primarily focus on the regularized horseshoe prior, also known as the Finnish horseshoe.
In this work, we propose the UQ-SINDy framework, which provides uncertainty estimates of both the parameter value and inclusion probabilities and promotes sparsity in realizations of the model. This model leverages advances sparsity and Bayesian approaches for solving ordinary differential equations (ODEs) to achieve this goal. Importantly, the UQ-SINDy framework is capable of model discovery with limited and noisy data, improving significantly on existing methods by requiring orders-of-magnitude less data to identify a stable, robust and sparse model. Even with as little as 21 time points, the UQ-SINDy framework can identify such a model, making it an ideal tool for application areas like biology where often limited data is available. Indeed, it is the only model discovery method capable of working in this limited and noisy data regime. In § §2.1 and 2.2, we review the SINDy method and Bayesian inference for ordinary differential equations (ODEs), respectively. In §2.3, we review sparsity promoting priors, namely the spike and slab and regularized horseshoe priors, and compare their performance to the Laplace prior. In §3.1, we introduce two sparsity promoting Bayesian methods, spike and slab SINDy and regularized horseshoe SINDy. In §3.2, we illustrate these methods on two synthetic nonlinear datasets, a Lotka-Volterra model and nonlinear oscillator, and one real-world example of lynx and hare population data. We find that these methods are able to extract accurate and meaningful Bayesian models even in the presence of significant noise and sparse samples. These results are summarized and future improvements are discussed in §4.

Background
The UQ-SINDy framework is based on several recent developments in the fields of sparse regression, ODEs and Bayesian inference, and we review these contributions here. In §2.1, we introduce the SINDy algorithm, which employs sparse regression to identify governing equations in the frequentist setting. In §2.2, we review Bayesian inference for ODEs. In §2.3, we review three different priors for sparse inference-the Laplace, spike and slab, and regularized horseshoe priors-and compare their benefits and drawbacks.

Sparse identification of nonlinear dynamics
The SINDy method is a recently developed technique that leverages sparse regression to identify the governing equations from a given time series (figure 1). We consider a system with state xðtÞ ¼ ½x 1 ðtÞ, x 2 ðtÞ, . . . , x d ðtÞ`[ R d governed by the differential equation for some unknown function f : R d ! R d . The system's state is observed at the discrete times t = t 1 , …, t n . The goal of SINDy is to discover f from these observations. ! ...   Figure 1. Comparison of SINDy algorithm and UQ-SINDy. (Top) Schematic of SINDy algorithm. A dynamical system governed by unknown governing equations is measured. Next, we compute the derivative of the time series _ X and construct a library QðXÞ of candidate terms. Last, we perform sparse regression to identify the terms in the library that best explain the time series. (Bottom) Schematic of UQ-SINDy algorithm. A dynamical system governed by unknown governing equations is measured. Next, we posit a SINDy library QðX Þ of candidate terms. Last, we perform sparsity promoting Bayesian inference to compute the inclusion probability and the posterior distribution of each term in the SINDy library. An ensemble of reconstructions can then be computed, which quantify the credibility of predictions.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 9: 211823 To do so, we postulate that f can be written as a linear combination of a library of l candidate functions u i : R d ! R, i ∈ [1, l ]. For example, a commonly used library is the polynomial library Next, we define X ¼ xðt 1 Þ, xðt 2 Þ, . . . , xðt n Þ ½ [ R nÂd as the collection of observed state snapshots, and also define the matrix of library terms evaluated at the observation times, We then measure or compute the time derivative of the data X and solve the following equation for J [ R lÂd : where J denotes the matrix of linear combination coefficients, or SINDy coefficients. A key assumption of SINDy is that f may be represented by a small number of library terms, so that the matrix J is sparse. Thus, (2.1) is typically solved through sparse regression, using minimization techniques such as sequential least squares thresholding (STLSQ) [5], Lasso [45], or a relaxed formulation [18]. The SINDy procedure yields the set of identified nonlinear differential equations Once identified, this system of differential equations may be used for system identification, prediction and control.

Bayesian inference for data-driven discovery
Suppose we have the dataset ðX, yÞ for which we would like to fit the linear regression model where e N ð0, s 2 IÞ is a vector of independent, identically distributed Gaussian measurement noise with unknown standard deviation σ. In the Bayesian setting, our goal is to determine the posterior distribution of b and σ conditioned on the data, i.e. pðb, sjX, yÞ. To compute this distribution, we leverage Bayes' rule, pðb, sjX, yÞ / pðyjb, XÞpðsÞpðbÞ, where pðyjb, XÞ denotes the data likelihood, and p(σ) and pðbÞ denote the prior distribution of the noise standard deviation and the regression coefficients. These prior distributions incorporate any available domain knowledge about the distribution of the noise standard deviation and the β j s. In this work, we are interested in identifying ODE models from noisy data. In particular, given noisy time series and a SINDy model of the form _ x`¼ QðxÞJ, our goal is to compute the posterior distribution of the initial conditions x 0 and SINDy coefficients J. We assume that the dataset X consists of n noisy snapshots of the observed dynamics, that is X ¼ ½y 1 , y 2 , . . . , y n `[ R nÂd , where y i [ R d is the noisy snapshot of the system state at time t = t i . For a given probabilistic model of the observation noise, the data is modelled as deviations from the SINDy predictions; for example, for additive noise models, where e i denotes the additive noise for the ith snapshot. Bayes' rule then takes the form pðJ, x 0 , fjXÞ / pðXjJ, x 0 , fÞpðfÞpðJÞpðx 0 Þ, ð2:5Þ where f denotes auxiliary variables of the probabilistic model such as the noise standard deviation. The data likelihood pðXjJ, x 0 , fÞ is given by the chosen observation model (e.g. by (2.4) and the distribution of the noise for additive observation noise). Computing the posterior distribution (2.5) is in general not analytically tractable, in which case sampling-based methods such as Markov chain Monte Carlo (MCMC) may be used. Once the posterior distribution has been approximated, we may then compute state reconstructions and forecasts conditioned on the observed data [57,58]. Specifically, to estimate the distribution of predicted values of x at an arbitrary time t, we marginalize the data likelihood times the posterior royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 9: 211823 distribution over J, x 0 and f, that is, pðxðtÞjJ, x 0 , fÞpðJ, x 0 , fjXÞ dJ dx 0 df: ð2:6Þ The distribution pðxðtÞjXÞ is referred to as the posterior predictive distribution (PPD). The integral in (2.6) can be approximated via sampling by taking the expectation of the data likelihood over posterior samples drawn via MCMC.

Sparsity promoting priors
Consider the regression problem in (2.3). In many cases, we assume only a few components of x i are relevant for predicting y i , in which case we expect b to be sparse. In the Bayesian setting, multiple sparsity-inducing priors have been proposed. We describe a few of these approaches below, namely the Laplace, spike and slab, and regularized horseshoe priors.

Laplace prior
Originally proposed by Laplace [59], the Laplace distribution, also known as the double exponential distribution [26], corresponds to the probability distribution function (PDF) f (x|μ, b) given by Most notably, maximum a posteriori (MAP) estimation for this prior corresponds to regression with ℓ 1 regularization, that is [46], In the frequentist setting, solving this regression problem, known as the Lasso problem, has been shown to yield sparse solutions for b [45]. This sparsifying behaviour of the Laplace distribution is attributed to the fact that for values of x smaller than b, the distribution is sharply peaked, thus pushing many terms toward 0. Additionally, for values of x greater than b, the distribution has longer tails than the Gaussian distribution, allowing elements to escape significant shrinkage. Although l 1 regularization induces sparsity in the frequentist case, in the Bayesian setting realizations of the corresponding posterior distributions are not sparse [47]. In particular, in the Bayesian setting we must consider the whole distribution simultaneously. With the Laplace prior, every β j has probability mass simultaneously pushed toward and away from the origin, forcing relevant β j s to shrink toward the origin and irrelevant terms to have significant probability mass far away from the origin.
To illustrate this, we generate 400 data samples (x i , y i ) that satisfy (2.3), where x i N ð0, 1Þ [ R 10 , e i N ð0, 0:5 2 Þ, and b [ R 10 is chosen to be the sparse vector b ¼ ½0:3, 0:2, À 0:3, 0, 0, 0, 0, 0, 0, 0`: We perform Bayesian inference to estimate b using a Laplace prior, and we plot the resulting posterior distribution in figure 2. We note that when using the Laplace prior, the posterior distributions are centred about the true value b. However, many distributions are peaked at non-zero values, making it difficult to differentiate between relevant and irrelevant variables. Further, due to the wide widths of all the distributions, samples from this posterior distribution will not be sparse. To better enforce sparsity in a Bayesian setting and induce sparse realizations, the distribution of each β j must either be fully shrunk towards the origin or pushed away from the origin. In § §2.3.2 and 2.3.3 we discuss two priors that satisfy these properties.

Spike and slab prior
The spike and slab prior is one of the most popular sparsifying priors and is typically referred to as the 'gold standard' [48][49][50] sparsity-inducing prior in the Bayesian setting. For this prior, each β j is generated using the hierarchical model b j jl j N ð0, c 2 Þl j royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 9: 211823 and l j BerðpÞ, where BerðpÞ denotes the Bernoulli distribution with probability of success π. Here, π is the prior probability that λ j is 1. Otherwise λ j is 0. From this it can be seen that if λ j is 1, then the jth term belongs to the model and β j follows the 'slab' distribution, a normal distribution with variance c 2 . If λ j is 0, then the jth term is not in the model and β j follows the 'spike' distribution, a Dirac delta distribution centred at zero.
The distribution may be relaxed to where ε ≪ c. This is similar to before, except when λ j = 0, β j follows a narrow normal distribution with variance ε 2 . The spike and slab prior for b is very intuitive and has shown robust performance in practical applications. In figure 2, we plot the resulting posterior distribution for the example in §2.3.2. Most notably, we see that similar to the Laplace prior, the spike and slab prior extracts out wide distributions for the three non-zero coefficients. For the seven zero coefficients, on the other hand, the distribution is sharply spiked at the origin. Consequently, any samples drawn from this posterior distribution will be truly sparse. Compared to the Laplace distribution, the non-zero terms are much more easily identifiable. Furthermore, the mean of λ j corresponds to the estimate of the 'inclusion probability', that is the likelihood that a particular β j is relevant to the model.
Although the spike and slab prior has many beneficial properties, one downside is that because of its discrete nature, inference with this prior requires exploring the combinatorial space of possible models. To address this challenge, many smooth approximations to the spike and slab prior distribution have been proposed. We discuss one recent approach in §2.3.3.

Regularized horseshoe prior
The horseshoe prior and the recently developed regularized horseshoe prior are smooth priors that have shown comparable performance to the spike and slab model. The horseshoe is defined as the hierarchical prior  royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 9: 211823 where C + ( · , · ) denotes the half-Cauchy distribution [51,52,60]. The key intuition behind this prior is that τ promotes global sparsity, shrinking the posterior distributions of all β i s. The λ i s, known as the local shrinkage parameters, also have half-Cauchy priors, allowing some of the β i s to escape significant shrinkage. Many analyses have focused on choosing an optimal value for τ 0 , and in Piironen et al. values are recommended for sparse linear regression [54]. In this work, we employ τ 0 = 0.1 unless specified otherwise. We note that decreasing the value of τ 0 increases the sparsity of b estimates.
One downside of the horseshoe is that relevant terms that 'escape' shrinkage are not regularized, and thus elements of the posterior distribution may become arbitrarily large. In [54] it was proposed to include a small amount of regularization on each λ i , resulting in the regularized horseshoe prior where Inv-GammaðÁ, ÁÞ denotes the inverse Gamma distribution, and ν and s are parameters that control the shape of the slab. For small values of λ i , λ i τ ≪ c andl i ! l i , thus approximating the original horseshoe prior. However, for large values of λ i , λ i τ ≫ c andl i ! c=t, leading to β i being normally distributed with variance c 2 . This regularizes β i , constraining it to be on the order of c. In this work, we employ the values ν = 4 and s = 2.
We illustrate the performance of this prior in figure 2 for the example in §2.3.2. Similar to the spike and slab model, the non-zero coefficients have wide distributions. The zero coefficients, on the other hand are more spiked than those of the Laplace prior, thus resulting in sparser posterior realizations.
Unlike for the spike and slab prior, there is no explicit estimate for the inclusion probabilities. A popular alternative for identifying the relevant terms is to compute the shrinkage factor of the coefficients. Specifically, we compute the MAP estimateb Flat i with a flat prior (i.e. no prior) and compare it to the MAP estimate with the regularized horseshoe prior,b RH i . The ratio of these two values is called the shrinkage factor ð2:7Þ The shrinkage factor of the coefficients has been used to define inclusion 'pseudo-probabilities' for sparsity-promoting models [52,54,60]. We employ this approach in this work. In general, these ratios may not lie between 0 and 1. For our work, we have observed that computingb Flat i with flat priors is challenging. To remedy this, we use normal priors b i N ð0, 1Þ instead of flat priors. Further, we note that (2.7) can be computed directly from MAP estimates, without having to sample the full posterior distributions. Thus, the shrinkage factors can be estimated using optimization techniques instead of full Bayesian inference. However, in practice the associated optimization problems may be non-convex and highly sensitive to the initial guess. Consequently, for this work we use full Bayesian inference to estimate shrinkage factors.

UQ-SINDy
In this section, we combine advances in model discovery for dynamical systems and sparsity promoting Bayesian inference to propose the UQ-SINDy framework, which aims to quantify the uncertainty of estimated SINDy coefficients due to measurement, and to estimate the inclusion probabilities for each term in the SINDy library. In particular, within this framework we introduce two methods: spike and slab SINDy (ss-SINDy) and regularized horseshoe SINDy (rh-SINDy). The ss-SINDy method provides state-of-the-art performance for estimating uncertainty of coefficients and inclusion probability, while the rh-SINDy is a smooth approximation that shows comparable performance. We outline this framework below.

Method
We start with a set of time series measurements X [ R nÂd contaminated by measurement noise. We assume that our data are governed by the SINDy model _ x`¼ QðxÞJ and xð0Þ ¼ x 0 , ð3:1Þ for some sparse matrix of SINDy coefficients J and initial condition x 0 . Our goal is to determine the posterior distribution pðJ, x 0 , fjXÞ.
Step 1: Construct library. We posit a library Q : R d ! R l of candidate functions. We emphasize here that Q is a symbolic vector function of the system's state x. This is in contrast to the original SINDy algorithm, in which QðXÞ is a fixed matrix computed from the time-series data.
Depending on the library, solving the ODE in (3.1) for certain values of initial conditions and parameters may be unstable. Practically, this leads to exploding gradients with respect to SINDy coefficients and initial conditions, and integration steps taken by the ODE solver becoming negligibly small. To remedy this, we add a higher-order polynomial term with a small negative coefficient to the ODE model. For example, for a library of terms up to quadratic order, we add a cubic term, leading to the ODE model where ξ i,j is the i, jth element of J. The parameter ɛ is chosen to be sufficiently small so that the ODE is not affected for values of the system's state that lie within the range of the data, but sufficiently large so that _ x does not grow too large. In general, if the library Q includes polynomial terms up to order n, we add a term À1x nþ1 i if n is even, or À1x nþ2 i if n is odd. This guarantees that the values _ x remain finite for both positive and negative values of x.
Step 2: Construct model priors and model likelihood. Letxðt; J, x 0 Þ denote the SINDy prediction at time t for given values of J and x 0 , given bŷ Qðxðt 0 ÞÞJ dt 0 : For normally distributed measurement noise, the data likelihood takes the form For some cases, the values of X takes non-negative values, such as for populations, in which case we may choose to use a lognormal likelihood instead, We must choose priors for the noise level parameters σ j and the initial conditions x 0 . These priors are chosen using knowledge about the type of parameter (i.e. whether the parameter is non-negative) and the scales of the data.
In this work, we assume that the noise is uncorrelated in time and among the different state variables. In the case where these correlations exist, the model likelihood and prior can be adjusted. In particular, in the case where the state variables are correlated, [61,62] recommend replacing σ i with a matrix S [ R dÂd and using an LKJ prior. If there are correlations in time, for example, in the case of coloured noise, a Whittle likelihood, with an inverse χ 2 prior for the noise may be used [63].
Step 3: Choose a sparsity promoting prior for the SINDy coefficients. Following §2.3, for spike and slab SINDy (ss-SINDy) we use the hierarchical prior j i,j jl j N ð0, 1Þl i,j a i,j and l i,j BerðpÞ: royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 9: 211823 For regularized horseshoe SINDy (rh-SINDy), we use the hierarchical prior c 2 Inv-Gamma n 2 , n 2 s 2 and t C þ ð0, t 0 Þ: For ss-SINDy, we have that f consists of the noise-level parameter σ and the local shrinkage paramaters λ i,j . For rh-SINDy, f consists of σ, the λ i,j s, c and τ. The coefficients α i,j , which we choose as constants for this analysis, allow us to incorporate any knowledge about the scales of different parameters. For this work, we choose α i = 1 unless stated otherwise.
Step 4: Bayesian Inference. Once the priors and the data likelihood are specified, we employ MCMC to draw samples from the posterior distribution pðJ, x 0 , fjXÞ. Furthermore, we estimate the PPD (2.6) for the reconstruction and forecasting tasks of interest. We employ MCMC algorithms as implemented in the Python library PyMC3 [64]; specifically, for rh-SINDy we use the No-U-Turn Sampler (NUTS) [65], and for ss-SINDy we use the compound step sampler implemented in PyMC3.
In the UQ-SINDy framework, NUTS leverages the gradients of the SINDy model prediction xðt; J, x 0 Þ with respect to J and x 0 . These gradients are computed using Sunode [66], a Python wrapper for the CVODES library [67] for solving forward and adjoint ODE problems.

Examples and applications
In this section, we apply the spike and slab and regularized horseshoe priors in the UQ-SINDy framework and illustrate their performance on three examples: two synthetic datasets and one realworld dataset of lynx and hare populations. For each example, we quantify the likelihood of each term of the SINDy library belonging to the underlying dynamical equations, providing both an estimate of the inclusion probability and a distribution of likely values for each SINDy coefficient. We compare these results to the original SINDy algorithm and show that UQ-SINDy significantly outperforms SINDy in identifying the underlying dynamics for noisy observations.

Lotka-Volterra model
We first study data from the Lotka-Volterra model, also commonly referred to as the predator-prey model, which is a popular system used to model the interaction between two competing groups [68,69]. Originally developed by Lotka to model chemical reactions [70], the system has also been studied as a model in economics [71] and for biological systems [72][73][74]. We explore one real-world example in §3.2.3.
The Lotka-Volterra model is given by the two nonlinear differential equations For this example, we simulate the system with the initial condition [u 0 , v 0 ] = [10,5] and parameters α = 1, β = 0.1, γ = 1.5 and δ = 0.075, as in [75], which results in a periodic trajectory. We sample 50 snapshots over a time interval of t ∈ [0, 24]. Additionally, we contaminate this trajectory with lognormal multiplicative noise with distribution lognormal(0, 0.1), which corresponds for this dataset to approximately 10% additive noise. The lognormal distribution is non-negative and is commonly used to model observation errors for state variables restricted to non-negative values. The resulting time series is shown in figure 3, from which we see that the trajectory covers approximately four periods of oscillation. For this example, we normalize the data as a preprocessing step by dividing each time series (of x and y) by the standard deviation of the data. The normalized data is governed by a differential equation of the same form as the unnormalized data, but with modified parametersã ¼ 1,b ¼ À0:68,g ¼ À1:5 and royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 9: 211823 d ¼ 0:82. This preprocessing step can be beneficial for systems in which the parameters are of different orders of magnitude.
We apply UQ-SINDy for both the spike and slab prior (ss-SINDy) and regularized horseshoe prior (rh-SINDy). We use a library of polynomial terms Qðu, vÞ ¼ ½1, u, v, u 2 , v 2 , uv, resulting in a 6 × 2 matrix of SINDy coefficients J. The SINDy model then reads For the noise level and initial condition, we employ the priors s u , s v Lognormalðm ¼ À1, s ¼ 0:1Þ and u 0 , v 0 Lognormalðm ¼ 0, s ¼ 1Þ, respectively. In table 1, we present the inclusion probability (for ss-SINDy) and pseudo-probability (for rh-SINDy) of each term in the library. We see significantly higher probabilities for the four true non-zero terms compared to all other terms, indicating that both ss-SINDy and rh-SINDy correctly identify the structure of the governing equation. We note that although the inclusion pseudo-probabilities are not constrained between zero and one, the relevant terms are easily identified with values near to or greater than 1.
In figure 3, we present the marginal posterior distributions of the SINDy coefficient. From this, we immediately see that for both priors, the parameters that belong to the model have broad distributions centred about the true means, while the other eight terms have narrow peaks centred about 0. In table 1, we compare the posterior modes of the SINDy coefficients against the true values of the model parameters. We additionally apply the original SINDy algorithm to the data. We see that SINDy is unable to identify the correct dynamics due to the presence of observation noise. Furthermore, we note that due to their sparsifying behaviours, the posterior mode of the SINDy coefficients for both the spike and slab and regularized horseshoe priors are close to the true values.
In figure 3, we present the mean and 90% credibility interval of the PPDs of the UQ-SINDy reconstructions of the system's states. Furthermore, we also present the prediction using SINDy (solid      4). We then compute the PPD over the entire time interval [0, 48] by sampling from (2.6), and plot the mean and 90% credibility interval of this distribution. We find that the mean of the PPD is very close in value to the true values in the test set. Further, we note that some samples in the test set lie near the bounds of the credibility intervals. This shows that our credibility bounds are tight and accurately capture the uncertainty due to measurement noise.

Nonlinear oscillator and model indeterminacy
As a second example, we consider the damped nonlinear oscillator model of the form  figure 5. We use a library of polynomial terms Qðu, vÞ ¼ ½1, u, v, u 2 , v 2 , uv, u 3 , v 3 , u 2 v, v 2 u, resulting in a 10 × 2 matrix of SINDy coefficients J. Since the observation noise is normally distributed noise we employ the data likelihood in (3.3). For the noise level and initial condition, we employ the priors s u , s v Gammaða ¼ 1, b ¼ 0:1Þ and u 0 , v 0 Laplaceðm ¼ 0, b ¼ 1Þ, respectively. First, we apply SINDy to the data, resulting in the estimated SINDy coefficients presented in table 2. It can be seen that SINDy does not identify the relevant terms in the model or correctly estimate the values of the model parameters. In fact, none of the terms in the SINDy model are zero. This example is particularly challenging for SINDy because of the sparse data sampling, the size of the library and  royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 9: 211823  the large range of magnitudes of the non-zero coefficients (specifically, note that |α| and |δ| are much smaller than |β| and |γ|). Next, we apply ss-SINDy and rh-SINDy to these data. The posterior modes are shown in table 2. We present the marginal posterior distributions of the SINDy coefficients in figure 5. It can be seen that rh-SINDy correctly identifies the governing equation; specifically, we see that the marginal posterior distribution of the SINDy coefficients for the terms in the equation are centred away from zero, while the distributions of all other terms are sharply centred at zero. On the other hand, ss-SINDy identifies the four terms in the governing equation, while also identifying an additional mode corresponding to a model without the _ u : u 3 term but with the _ u : u 2 v and _ v : v 2 terms. These results are reflected in table 2, for which we show the posterior modes of the SINDy coefficients and the corresponding inclusion probabilities and pseudo-probabilities. For rh-SINDy, the four non-zero terms are clearly identified with modes close to the true values and inclusion pseudo-probabilities for the four terms close to one. For ss-SINDy, three of the terms are clearly identified with an inclusion probability close to one, while the terms _ u : u 3 , _ u : u 2 v and _ v : v 2 have inclusion probabilities of 0.5, 0.7 and 0.5, respectively. In figure 5, we present the mean and 90% credibility intervals of the PPDs of the reconstruction of the system's states, together with the training data. Similarly, in figure 6 we present the 90% credibility intervals of the PPDs of future state forecasting for testing data over the time interval (20,40]. Both rh-SINDy and ss-SINDy lead to similar credibility intervals for both reconstruction and forecasting. We note that the range of predicted model states is much narrower than for the Lotka-Volterra model, which is expected due to the lower noise level present in these measurements. We also emphasize that these PPDs are much tighter than those presented in [3] for this test case, even though we train rh-SINDy and ss-SINDy with substantially less data than in that work. Furthermore, it can be seen that the test data lie within the 90% credibility intervals of the PPDs of each state. Although some of the draws from the ss-SINDy PPD contain terms not in the model, the credibility intervals for both ss-SINDy and rh-SINDy are similar. This suggests that the ambiguity identified by the spike and slab prior is due to model indeterminacy inherent in these dataset.

ss-SINDy rh-SINDy
This indeterminacy can be attributed to the range of values spanned by the coefficients in the governing equation. In particular, the coefficients of _ u : u 3 and _ v : v 3 are an order of magnitude smaller than the coefficients of _ u : v 3 and _ u : u 3 . To further investigate this indeterminacy, we re-applied ss-SINDy and rh-SINDy with α i,j = 0.1 for the terms _ u : u 3 and _ v : v 3 . This scaling of the prior incorporates the knowledge that these two terms have coefficients of magnitude O(0.1). The resulting marginal posterior distributions, presented in figure 7, show that this scaling of the prior removes this ambiguity. The corresponding posterior modes and inclusion probabilities and pseudo-probabilities are presented in table 3.

Lynx-hare population model
As a final example, we apply ss-SINDy and rh-SINDy as described in §3.2.1 to model the population dynamics of two species in Canada. In particular, we consider data consisting of measurements by the Hudson Bay Company of lynx and hare pelts between 1900 and 1920 [75,76] (figure 8). The number of pelts for these two species is thought to be proportional to the true populations. Hares are a herbivorous relative of the rabbit, while the lynx is a type of wildcat whose diet depends heavily on hares. This predator-prey interdependence between the two species has been shown to be well  Figure 6. Forecasting using ss-SINDy (a) and rh-SINDy (b) for the nonlinear oscillator model. We train using samples from the nonlinear oscillator model over the time interval [0, 20] (red and blue crosses) and test on samples over the time interval (20,40] (black crosses). The mean (dashed lines) and 90% credibility intervals (dashed areas) of the PPDs are plotted for the entire time interval. characterized to first-order by the Lotka-Volterra model (3.5), where u and v correspond to the populations of hares and lynx, respectively. Empirically, the data have been found to be very noisy, with a noise level of approximately 25% [75]. Figure 8 presents the number of pelts recorded yearly for these two species over 21 years. Modelling these data with SINDy is particularly challenging because we have relatively few samples that cover only two cycles. In addition, factors such as the weather and the consistency of trapping between years adds uncertainty to the measurements. Here we compare the performance of ss-SINDY, and rh-SINDY for model discovery under uncertainty. The SINDy library, as in the Lotka-Volterra example, contains all constant, linear and quadratic terms. In addition, as a preprocessing step we normalize the data as described in §3.2. 1 The marginal posterior distributions computed using ss-SINDy and rh-SINDy are presented in figure 8. The posterior modes and inclusion probabilities and pseudo-probabilities are presented in table 4, together with maximum-likelihood estimates of the coefficients of the Lotka-Volterra model for the lynx-hare data [75], and estimates computed using the original SINDy algorithm. It can be seen that for ss-SINDy the distinct non-zero peaks correspond to the terms in (3.5). The likelihood of these four terms belonging to the model are very high. We additionally see a small peak near zero for _ u : u. This term is highly correlated with a non-zero constant term. We see a similar but more pronounced peak for rh-SINDy. Table 4 shows that ss-SINDy correctly identifies the Lotka-Volterra model and assigns high inclusion probabilities to the four terms in such a model. On the other hand, rh-SINDy identifies three of the four terms correctly. Furthermore, it can be seen that SINDy fails to identify the Lotka-Volterra model, and that the posterior modes for ss-SINDy and rh-SINDy are closer to the maximum-likelihood estimates than the SINDy estimates.
Last, in figure 8, we present the mean and 90% credibility intervals of the PPDs of the time series reconstruction. We note that all data lie within these credibility bounds. The SINDy reconstructions, on the other hand, appear to deviate from the time series for later times. spike and slab prior (a) royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 9: 211823

Conclusion and future work
In this work, we proposed UQ-SINDy, a new uncertainty quantification framework for identifying governing ODEs directly from noisy and sparse time series data. We leverage advances in model discovery for dynamical systems and sparsity promoting Bayesian inference to identify a sparse set of SINDy library functions that best explain the observed data and quantify the uncertainty in the SINDy coefficients due to measurement noise and the probability of inclusion of each term in the final model. We have applied UQ-SINDy to two synthetic examples and one real-world example of lynxhare population data. By using the spike-and-slab and regularized horseshoe priors, UQ-SINDy yields posterior distributions of SINDy coefficients with truly sparse draws, and thus results in truly sparse probabilistic model discovery; in contrast, the use of the Laplace prior does not lead to sparse model discovery. We observe that the proposed approach is robust against observation noise and can accommodate sparse samples and small datasets. Going forward, one of the primary limitations of this method is its scalability to very large SINDy libraries, such as libraries of rational functions, which are common in many biological and physical systems [24,25]. This is primarily due to the computational cost of sampling high-dimensional posterior distributions using MCMC. One remedy for this is to use variational inference, which matches classes of distributions to the posterior distribution by maximizing a lower bound on the marginal likelihood of the data. This method has been particularly effective for high-dimensional models, most notably neural networks, with comparable accuracy to sampling-based methods. Furthermore, in this work we are primarily focused on situations in which the coordinates that induce a sparse representation are known. However, in general this 'effective' set of coordinates may be unknown. Recent work merges SINDy together with neural network architectures in order to simultaneously learn parsimonious governing equations and the associated sparsity-inducing coordinate transformation [17]. Extending UQ-SINDy to this coordinate discovery framework could u · v · u · v · spike and slab prior regularized horseshoe prior